Deviation of Zipf's and Heaps' Laws in Human Languages with Limited Dictionary Sizes

نویسندگان

  • Linyuan Lü
  • Zi-Ke Zhang
  • Tao Zhou
چکیده

Zipf's law on word frequency and Heaps' law on the growth of distinct words are observed in Indo-European language family, but it does not hold for languages like Chinese, Japanese and Korean. These languages consist of characters, and are of very limited dictionary sizes. Extensive experiments show that: (i) The character frequency distribution follows a power law with exponent close to one, at which the corresponding Zipf's exponent diverges. Indeed, the character frequency decays exponentially in the Zipf's plot. (ii) The number of distinct characters grows with the text length in three stages: It grows linearly in the beginning, then turns to a logarithmical form, and eventually saturates. A theoretical model for writing process is proposed, which embodies the rich-get-richer mechanism and the effects of limited dictionary size. Experiments, simulations and analytical solutions agree well with each other. This work refines the understanding about Zipf's and Heaps' laws in human language systems.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Zipf's Law Leads to Heaps' Law: Analyzing Their Relation in Finite-Size Systems

BACKGROUND Zipf's law and Heaps' law are observed in disparate complex systems. Of particular interests, these two laws often appear together. Many theoretical models and analyses are performed to understand their co-occurrence in real systems, but it still lacks a clear picture about their relation. METHODOLOGY/PRINCIPAL FINDINGS We show that the Heaps' law can be considered as a derivative ...

متن کامل

Evolution of Scaling Emergence in Large-Scale Spatial Epidemic Spreading

BACKGROUND Zipf's law and Heaps' law are two representatives of the scaling concepts, which play a significant role in the study of complexity science. The coexistence of the Zipf's law and the Heaps' law motivates different understandings on the dependence between these two scalings, which has still hardly been clarified. METHODOLOGY/PRINCIPAL FINDINGS In this article, we observe an evolutio...

متن کامل

Facts about the Thai Web

This paper presents the latest status of Thai web servers. Quantitative measurements are based on database crawling on July 2000. Our experiment shows that the Heaps’ and Zipf's laws apply strongly to documents on the Thai web. A visualization tool is developed to show servers connectivity.

متن کامل

Do neural nets learn statistical laws behind natural language?

The performance of deep learning in natural language processing has been spectacular, but the reasons for this success remain unclear because of the inherent complexity of deep learning. This paper provides empirical evidence of its effectiveness and of a limitation of neural networks for language engineering. Precisely, we demonstrate that a neural language model based on long short-term memor...

متن کامل

Comments on "linguistic features in eukaryotic genomes"

Tsonis and Tsonis [1] study rank-ordered distributions of the number of occurrences of protein domains in four different organisms, and they argue that the power-law decay, f ϰ 1/r, of the number f of occurrences of a protein domain with its rank r suggests the presence of linguistic features in eukaryotic genomes, and that this finding " may lead to important clues about the evolution of langu...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره 3  شماره 

صفحات  -

تاریخ انتشار 2013